Code
library(tidyverse)
library(lmtest)
library(sf)
library(mapview)
library(GGally)
library(stargazer)
knitr::opts_chunk$set(echo = TRUE)Quinn He
October 19, 2022
Rows: 13932 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): LATITUDE, LONGITUDE, PARCELNO, SALE_PRC, LND_SQFOOT, TOT_LVG_AREA,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Housing prices are always difficult to predict and can fluctuate randomly due to various variables or economic events. At the core, location is an extremely important factor in dictating the price of houses, and always will be. The purpose of this project is not to break new ground in the study of house prices, but to implement multiple regression techniques in determining how important home location is for single-family houses in Miami. I am particularly interested in Miami due to its proximity to the ocean and because Florida is a flat, low elevation state. With the threat of climate change and the increase in hurricanes every year, I want to examine Miami’s housing market in without climate variables included. For a future study, I would be interested to see if my discoveries hold up with increase in sea level and hurricane variables implemented.
Previous studies have pointed to outside factors like unemployment rate, mortgage rates, stocks, etc. as determinants of house prices, but for this study location is of primary concern.
The price of houses in Miami are influenced by their location, which includes the distance from various desirable locations.
The dataset chosen for the regression analysis contains information on 13,932 single-family homes sold in Miami.
Below are the names of the columns and what each one represents:
PARCELNO: unique identifier for each property. About 1% appear multiple times. SALE_PRC: sale price (\() LND_SQFOOT: land area (square feet) TOTLVGAREA: floor area (square feet) SPECFEATVAL: value of special features (e.g., swimming pools) (\)) RAIL_DIST: distance to the nearest rail line (an indicator of noise) (feet) OCEAN_DIST: distance to the ocean (feet) WATER_DIST: distance to the nearest body of water (feet) CNTR_DIST: distance to the Miami central business district (feet) SUBCNTR_DI: distance to the nearest subcenter (feet) HWY_DIST: distance to the nearest highway (an indicator of noise) (feet) age: age of the structure avno60plus: dummy variable for airplane noise exceeding an acceptable level structure_quality: quality of the structure month_sold: sale month in 2016 (1 = jan) LATITUDE LONGITUDE
I change a few of the variable names to make calling into functions easier because I do not always want to refer to all caps lettering.
miami_housing <- miami_housing %>%
rename("latitude" = "LATITUDE",
"longitude" = "LONGITUDE",
"sale_price" = "SALE_PRC",
"land_sqfoot" = "LND_SQFOOT",
"floor_sqfoot" = "TOT_LVG_AREA",
"special_features" = "SPEC_FEAT_VAL",
"dist_2_nearest_water" = "WATER_DIST",
"dist_2_biz_center" = "CNTR_DIST",
"dis_2_nearest_subcenter"= "SUBCNTR_DI",
"home_age" = "age") Before any model fitting and analysis, it would be beneficial to get an overall view of the data I am dealing with. With the summary() function I can get a large table that lets me look at all the different summary statistics of each variable. First off, the special features variable is denoted in price of a certain feature to the home, be it a swimming pool, solar power, or a hot tub. Control variables will help determine how important the distance to ocean variable is on the sale price.
latitude longitude PARCELNO sale_price
Min. :25.43 Min. :-80.54 Min. :1.020e+11 Min. : 72000
1st Qu.:25.62 1st Qu.:-80.40 1st Qu.:1.079e+12 1st Qu.: 235000
Median :25.73 Median :-80.34 Median :3.040e+12 Median : 310000
Mean :25.73 Mean :-80.33 Mean :2.356e+12 Mean : 399942
3rd Qu.:25.85 3rd Qu.:-80.26 3rd Qu.:3.060e+12 3rd Qu.: 428000
Max. :25.97 Max. :-80.12 Max. :3.660e+12 Max. :2650000
land_sqfoot floor_sqfoot special_features RAIL_DIST
Min. : 1248 Min. : 854 Min. : 0 Min. : 10.5
1st Qu.: 5400 1st Qu.:1470 1st Qu.: 810 1st Qu.: 3299.4
Median : 7500 Median :1878 Median : 2766 Median : 7106.3
Mean : 8621 Mean :2058 Mean : 9562 Mean : 8348.5
3rd Qu.: 9126 3rd Qu.:2471 3rd Qu.: 12352 3rd Qu.:12102.6
Max. :57064 Max. :6287 Max. :175020 Max. :29621.5
OCEAN_DIST dist_2_nearest_water dist_2_biz_center
Min. : 236.1 Min. : 0 Min. : 3826
1st Qu.:18079.3 1st Qu.: 2676 1st Qu.: 42823
Median :28541.8 Median : 6923 Median : 65852
Mean :31691.0 Mean :11960 Mean : 68490
3rd Qu.:44310.7 3rd Qu.:19200 3rd Qu.: 89358
Max. :75744.9 Max. :50400 Max. :159976
dis_2_nearest_subcenter HWY_DIST home_age avno60plus
Min. : 1463 Min. : 90.2 Min. : 0.00 Min. :0.00000
1st Qu.: 23996 1st Qu.: 2998.1 1st Qu.:14.00 1st Qu.:0.00000
Median : 41110 Median : 6159.8 Median :26.00 Median :0.00000
Mean : 41115 Mean : 7723.8 Mean :30.67 Mean :0.01493
3rd Qu.: 53949 3rd Qu.:10854.2 3rd Qu.:46.00 3rd Qu.:0.00000
Max. :110554 Max. :48167.3 Max. :96.00 Max. :1.00000
month_sold structure_quality
Min. : 1.000 Min. :1.000
1st Qu.: 4.000 1st Qu.:2.000
Median : 7.000 Median :4.000
Mean : 6.656 Mean :3.514
3rd Qu.: 9.000 3rd Qu.:4.000
Max. :12.000 Max. :5.000

The graph indicates homes that are closer to the ocean tend to have a higher sale price than homes that are farther away. I notice at the 40,000 and 60,000 mark of distance, there is a general spike in house prices, but I cannot determine what that would be. As stated previously, there are many other factors that can contribute to housing price, but the graph shows distance to ocean is a clear predictor. I will now log this model to correct for the U shaped trend these points take.

I want to get a quick summary to understand how age impacts the structure of the home and in turn impacts the price. As I expected, homes with the lowest quality of their structure are on average the oldest and cheapest. For the most part, this trend follows the same for the other home ages and structure quality. Structure 3 is interesting because homes there are the youngest and also the most expensive. I included the distance to ocean variable to see how this relates to the research project and I found structure 3 is also the closest to the ocean, which may indicate a substantial impact on price.
Since newer homes that have the highest quality structure are the second most expensive, it also seems structure quality plays a significant role in the determination of house price when distance to ocean is not valued.
# A tibble: 5 × 4
structure_quality `mean(home_age)` `mean(sale_price)` `mean(OCEAN_DIST)`
<dbl> <dbl> <dbl> <dbl>
1 1 66.1 162640. 23426.
2 2 27.6 269672. 25564.
3 3 15.1 1847250 7103.
4 4 32.0 382571. 35087.
5 5 28.8 743189. 32269.
# A tibble: 13,932 × 17
# Groups: structure_quality [5]
latitude longitude PARCELNO sale_…¹ land_…² floor…³ speci…⁴ RAIL_…⁵ OCEAN…⁶
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25.9 -80.2 6.22e11 440000 9375 1753 0 2816. 12811.
2 25.9 -80.2 6.22e11 349000 9375 1715 0 4359. 10648.
3 25.9 -80.2 6.22e11 800000 9375 2276 49206 4413. 10574.
4 25.9 -80.2 6.22e11 988000 12450 2058 10033 4585 10156.
5 25.9 -80.2 6.22e11 755000 12800 1684 16681 4063. 10837.
6 25.9 -80.2 6.22e11 630000 9900 1531 2978 2391. 13017
7 25.9 -80.2 6.22e11 1020000 10387 1753 23116 3277. 11668.
8 25.9 -80.2 6.22e11 850000 10272 1663 34933 3112. 11718.
9 25.9 -80.2 6.22e11 250000 9375 1493 11668 2082. 13044.
10 25.9 -80.2 6.22e11 1220000 13803 3077 34580 2938. 11918.
# … with 13,922 more rows, 8 more variables: dist_2_nearest_water <dbl>,
# dist_2_biz_center <dbl>, dis_2_nearest_subcenter <dbl>, HWY_DIST <dbl>,
# home_age <dbl>, avno60plus <dbl>, month_sold <dbl>,
# structure_quality <dbl>, and abbreviated variable names ¹sale_price,
# ²land_sqfoot, ³floor_sqfoot, ⁴special_features, ⁵RAIL_DIST, ⁶OCEAN_DIST
Error in `fortify()`:
! `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class uneval.
Did you accidentally pass `aes()` to the `data` argument?
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This visualization depicts the distribution of homes based on their sale price. Most of the prices of homes fall slightly below the 500,000 mark at around 400,000 dollars as I would have to guess. A few homes are well past the 1 million and 2 million dollar mark.
The map view is very cluttered, but it is an interactive map that allows me to click on the data point and look at all the information associated with that house. For example, data points in Key Biscayne range from about $1.6 to $2.6 million. Most of the homes are only ~2000 feet away from the ocean. The case is the same for houses in Miami Beach, Surf Side, and Sunny Isles Beach.